46 research outputs found
Review on data-centric brain-inspired computing paradigms exploiting emerging memory devices
Biologically-inspired neuromorphic computing paradigms are computational platforms that imitate synaptic and neuronal activities in the human brain to process big data flows in an efficient and cognitive manner. In the past decades, neuromorphic computing has been widely investigated in various application fields such as language translation, image recognition, modeling of phase, and speech recognition, especially in neural networks (NNs) by utilizing emerging nanotechnologies; due to their inherent miniaturization with low power cost, they can alleviate the technical barriers of neuromorphic computing by exploiting traditional silicon technology in practical applications. In this work, we review recent advances in the development of brain-inspired computing (BIC) systems with respect to the perspective of a system designer, from the device technology level and circuit level up to the architecture and system levels. In particular, we sort out the NN architecture determined by the data structures centered on big data flows in application scenarios. Finally, the interactions between the system level with the architecture level and circuit/device level are discussed. Consequently, this review can serve the future development and opportunities of the BIC system design
FourierPIM: High-Throughput In-Memory Fast Fourier Transform and Polynomial Multiplication
The Discrete Fourier Transform (DFT) is essential for various applications
ranging from signal processing to convolution and polynomial multiplication.
The groundbreaking Fast Fourier Transform (FFT) algorithm reduces DFT time
complexity from the naive O(n^2) to O(n log n), and recent works have sought
further acceleration through parallel architectures such as GPUs.
Unfortunately, accelerators such as GPUs cannot exploit their full computing
capabilities as memory access becomes the bottleneck. Therefore, this paper
accelerates the FFT algorithm using digital Processing-in-Memory (PIM)
architectures that shift computation into the memory by exploiting physical
devices capable of storage and logic (e.g., memristors). We propose an O(log n)
in-memory FFT algorithm that can also be performed in parallel across multiple
arrays for high-throughput batched execution, supporting both fixed-point and
floating-point numbers. Through the convolution theorem, we extend this
algorithm to O(log n) polynomial multiplication - a fundamental task for
applications such as cryptography. We evaluate FourierPIM on a
publicly-available cycle-accurate simulator that verifies both correctness and
performance, and demonstrate 5-15x throughput and 4-13x energy improvement over
the NVIDIA cuFFT library on state-of-the-art GPUs for FFT and polynomial
multiplication
ClaPIM: Scalable Sequence CLAssification using Processing-In-Memory
DNA sequence classification is a fundamental task in computational biology
with vast implications for applications such as disease prevention and drug
design. Therefore, fast high-quality sequence classifiers are significantly
important. This paper introduces ClaPIM, a scalable DNA sequence classification
architecture based on the emerging concept of hybrid in-crossbar and
near-crossbar memristive processing-in-memory (PIM). We enable efficient and
high-quality classification by uniting the filter and search stages within a
single algorithm. Specifically, we propose a custom filtering technique that
drastically narrows the search space and a search approach that facilitates
approximate string matching through a distance function. ClaPIM is the first
PIM architecture for scalable approximate string matching that benefits from
the high density of memristive crossbar arrays and the massive computational
parallelism of PIM. Compared with Kraken2, a state-of-the-art software
classifier, ClaPIM provides significantly higher classification quality (up to
20x improvement in F1 score) and also demonstrates a 1.8x throughput
improvement. Compared with EDAM, a recently-proposed SRAM-based accelerator
that is restricted to small datasets, we observe both a 30.4x improvement in
normalized throughput per area and a 7% increase in classification precision
MTJ-Based Hardware Synapse Design for Quantized Deep Neural Networks
Quantized neural networks (QNNs) are being actively researched as a solution
for the computational complexity and memory intensity of deep neural networks.
This has sparked efforts to develop algorithms that support both inference and
training with quantized weight and activation values without sacrificing
accuracy. A recent example is the GXNOR framework for stochastic training of
ternary and binary neural networks. In this paper, we introduce a novel
hardware synapse circuit that uses magnetic tunnel junction (MTJ) devices to
support the GXNOR training. Our solution enables processing near memory (PNM)
of QNNs, therefore can further reduce the data movements from and into the
memory. We simulated MTJ-based stochastic training of a TNN over the MNIST and
SVHN datasets and achieved an accuracy of 98.61% and 93.99%, respectively
Experimental Demonstration of Non-Stateful In-Memory Logic with 1T1R OxRAM Valence Change Mechanism Memristors
Processing-in-memory (PIM) is attractive to overcome the limitations of
modern computing systems. Numerous PIM systems exist, varying by the
technologies and logic techniques used. Successful operation of specific logic
functions is crucial for effective processing-in-memory. Memristive
non-stateful logic techniques are compatible with CMOS logic and can be
integrated into a 1T1R memory array, similar to commercial RRAM products. This
paper analyzes and demonstrates two non-stateful logic techniques: 1T1R logic
and scouting logic. As a first step, the used 1T1R SiO\textsubscript{x} valence
change mechanism memristors are characterized in reference to their feasibility
to perform logic functions. Various logical functions of the two logic
techniques are experimentally demonstrated, showing correct functionality in
all cases. Following the results, the challenges and limitations of the RRAM
characteristics and 1T1R configuration for the application in logical functions
are discussed.Comment: 5 pages, 6 figure